Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Principal Methods

In this section, we list several works that modify the structure of [BNNs], contributing to

better performance or convergence of the network. XNOR-Net and Bi-real Net make minor

adjustments to the original networks, while MCN proposes new ﬁlters and convolutional

operations. The loss function is also adjusted according to the new ﬁlters, which will be

introduced in Section 1.1.5.

1.1.5

Loss Design

During neural network optimization, the loss function is used to estimate the diﬀerence

between the real and predicted values of a model. Some classical loss functions, such as

least squares loss and cross-entropy loss, are widely used in classiﬁcation and regression

problems. This section will review the speciﬁc loss function used in [BNNs].

MCNs [236] propose a novel loss function that considers ﬁlter loss, center loss, and

softmax loss in an end-to-end framework. The loss function in MCNs consists of two parts:

L = LM + LS.

(1.13)

The ﬁrst part LM is:

LM = ^θ

i,l

i ⁻^ˆ^C^l

i ^◦^M^l²⁺^λ

fm( ˆC, ⃗M) −¯f( ˆC, ⃗M)

(1.14)

where C is the full precision weights, ^ˆC is the binarized weights, M is the M-Filters deﬁned

in Section 1.1.4, fm denotes the feature map of the last convolutional layer for the m^th

sample, and ^¯f denotes the class-speciﬁc mean feature map of previous samples. The ﬁrst

entry of LM represents the ﬁlter loss, while the second entry calculates the center loss using

a conventional loss function, such as the softmax loss.

PCNNs [77] propose a projection loss for discrete backpropagation. It is the ﬁrst to

deﬁne the quantization of the input variable as a projection onto a set to obtain a projec-

tion loss. Our BONNs [287] propose a Bayesian-optimized 1-bit CNN model to improve the

performance of 1-bit CNNs signiﬁcantly. BONNs incorporate the prior distributions of full-

precision kernels, features, and ﬁlters into a Bayesian framework to construct 1-bit CNNs

comprehensively, end-to-end. They denote the quantization error as y and the full-precision

weights as x. They maximize p(x|y) to optimize x for quantization to minimize the recon-

structed error. This optimization problem can be converted to a maximum a posteriori since

the distribution of x is known. For feature quantization, the method is the same. Therefore,

the Bayesian loss is as follows:

LB = ^λ

l=1

C^l

i=1

C^l

n=1

{

ˆkl,i

n ⁻^w^l^◦^k^l,i

+ v(k^l,i

n+ ⁻^μ^l

i+⁾^T^(Ψ^l

i+⁾⁻¹⁽^k^l,i

n+ ⁻^μ^l

i+⁾

+ v(k^l,i

n−⁻^μ^l

i−⁾^T^(Ψ^l

i−⁾⁻¹⁽^k^l,i

n−⁻^μ^l

i−⁾

vlog(det(Ψ^l))} + ^θ

m=1

{

fm −cm

n=1

σ⁻²

m,n⁽^f^m,n ⁻^c^m,n⁾²⁺^log⁽^σ²

m,n⁾

(1.15)